Project 3 -- PCAΒΆ
Part 1: Getting startedΒΆ
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
data_raw = pd.read_csv(
filepath_or_buffer='https://raw.githubusercontent.com/Sabaae/Dataset/main/domestic_product.csv',
index_col=0
)
def plot_time_series(df, countries):
fig, axes = plt.subplots(len(countries), 1, figsize=(10, len(countries) * 5))
for i, country in enumerate(countries):
country_data = df.loc[country]
axes[i].plot(country_data.index, country_data.values)
axes[i].set_title(f"Time-series for{country}")
axes[i].set_xlabel("Year")
axes[i].set_ylabel("Gross Domestic Product")
years = country_data.index
axes[i].set_xticks(years[::5])
plt.show()
countries_list = [" Albania ", " Greece ", " Somalia ", " Sweden ", " Oman ", " Italy "]
print("Original Plots: ")
plot_time_series(data_raw, countries_list)
scaler = StandardScaler()
data_standardized = data_raw.copy()
data_standardized[:] = scaler.fit_transform(data_standardized)
print("Standardized Plots: ")
plot_time_series(data_standardized, countries_list)
Original Plots:
Standardized Plots:
The standardized time-series for these countries illustrates each country's GDP relative to others, as the values are scaled to have a mean of 0 and a standard deviation of 1. Unlike the original plots, the trends in the standardized plots reflect relative changes, with the curves going up and down frequently rather than showing the absolute GDP growth from the original dataset. Most countries show a slightly decreasing trend, with lower values in 2020 than in 1970. However, Oman is an exception, with a value of -0.191 in 1970 and -0.189 in 2020. This slight increase over time indicates that Oman's GDP growth relative to other countries was faster compared to earlier years. Similarly, when the curve goes down, it means the country's GDP growth is slower relative to other countries. A positive value indicates that the country's GDP for a given year is above the mean GDP of all countries for that year, while a negative value indicates it is below the mean GDP.
Part 2: Applying PCAΒΆ
- Compute the covariance matrix of the dataframe. *Hint: The dimensions of your covariance matrix should be (52, 52).
- Write a function
get_sorted_eigen(df_cov)that gets the covariance matrix of dataframedf(from step 1), and returns sorted eigenvalues and eigenvectors usingnp.linalg.eigh. - Show the effectiveness of your principal components in covering the variance of the dataset with a
scree plot. - How many PCs do you need to cover 99.9% of the dataset's variance?
The cumulative variance reaches 99.9% at the fifth component of the printed
cumExpVar, showing that 99.91% exceeds 99.9%. Therefore, 5 principal components are required to cover 99.9% of dataset's variance. - Plot the first 16 principal components (Eigenvectors) as a time series (16 subplots, on the x-axis you have dates and on the y-axis you have the value of the PC element).
- Compare the first few PCs with the rest of them. Do you see any difference in their trend? The first few PCs show clear trends that reveal major patterns in the data. PC 1 decreases steadily from 0.139 in 1970 to 0.064 in 2020. The value of PC 2 shows an increase from 0.003 to 0.032. PC 3 has a slight overall increase but with small fluctuations. In contrast, the rest PCs from the fourth one do not show any clear trend, instead moving up and down randomly during the time period. This difference means that the first few PCs capture key trends or changes over time, while the later PCs reflect smaller, less consistent variations.
import numpy as np
# 1. The covariance matrix of the dataframe
df_cov = np.cov(data_standardized.T)
print(df_cov.shape)
# 2. Function that gets the covariance matrix of df, returns sorted eigenvalues and eigenvectors
def get_sorted_eigen(df_cov):
eigenValues, eigenVectors = np.linalg.eigh(df_cov)
# sorting in decreasing order
args = (-eigenValues).argsort()
eigenValues = eigenValues[args]
eigenVectors = eigenVectors[:, args]
return eigenValues, eigenVectors
# 3. Show effectiveness of principle components in covering the variance
eigenValues, eigenVectors = get_sorted_eigen(df_cov)
eigValSum = sum(eigenValues)
expVar = [eigV/eigValSum*100 for eigV in eigenValues]
cumExpVar = np.cumsum(expVar)
print(cumExpVar)
plt.bar(range(10), expVar[:10], label='Explained Variance')
plt.plot(cumExpVar[:10], 'r-o', label='Cumulative Explained Variance')
plt.legend()
plt.show()
# 5. Plot the first 16 PCs as a time series
years = data_standardized.columns
fig, axes = plt.subplots(16, 1, figsize=(10, 16 * 3))
for i in range(16):
pc_data = eigenVectors[:, i]
axes[i].plot(years, pc_data)
axes[i].set_title(f"Time-series for PC {i+1}")
axes[i].set_xlabel("Year")
axes[i].set_ylabel(f"PC Value")
axes[i].set_xticks(years[::5])
plt.tight_layout()
plt.show()
(52, 52) [ 95.31499014 98.95623653 99.67607837 99.83935344 99.90633114 99.94418134 99.96358004 99.97863499 99.98407348 99.98819769 99.99105253 99.99334173 99.99490079 99.99604911 99.99703884 99.99751032 99.9979238 99.99829534 99.99860684 99.9988423 99.99903082 99.99921154 99.99936265 99.99947642 99.99957731 99.99965223 99.99971498 99.9997618 99.99980548 99.99984546 99.99987235 99.99989208 99.9999105 99.99992477 99.99993781 99.999947 99.99995571 99.9999625 99.99996881 99.99997425 99.99997844 99.99998221 99.99998542 99.99998845 99.99999124 99.99999335 99.999995 99.9999964 99.9999977 99.999999 99.99999962 100. ]
Part 3: Data reconstructionΒΆ
Create a function that:
- Accepts a country and the original dataset as inputs.
- Calls useful functions that you designed in previous parts to compute eigenvectors and eigenvalues.
- Plots 4 figures:
- The original time-series for the specified country.
- The incremental reconstruction of the original (not standardized) time-series for the specified country in a single plot.
You should at least show 5 curves in a figure for incremental reconstruction. For example, you can pick the following (or any other combination that you think is reasonable):
- Reconstruction with only PC1
- Reconstruction with both PC1 and PC2
- Reconstruction with PC1 to PC4 (First 4 PCs)
- Reconstruction with PC1 to PC8 (First 8 PCs)
- Reconstruction with PC1 to PC16 (First 16 PCs)
Hint: you need to compute the reconstruction for the standardized time-series first, and then scale it back to the original (non-standardized form) using the StandardScaler
inverse_transformhelp...
- The residual error for your best reconstruction with respect to the original time-series.
- Hint: You are plotting the error that we have for reconstructing each year
(df - df_reconstructed). On the x-axis, you have dates, and on the y-axis, the residual error.
- The RMSE of the reconstruction as a function of the number of included components (x-axis is the number of components and y-axis is the RMSE). Sweep x-axis from 1 to 10 (this part is independent from part 3.2.)
Test your function using the three countries you selected in Part 1.3 plus Greece,Somalia, and Italy as inputs.
from sklearn.metrics import mean_squared_error
def compute_recon(X_std, PC_count, eigenVectors):
projX = np.dot(X_std, eigenVectors[:, 0:PC_count])
reconX = np.dot(projX, eigenVectors[:, 0:PC_count].T)
return reconX
def plot_country_figures(original_df, country_name):
country_data = original_df.loc[country_name]
country_data_std = scaler.transform(country_data.values.reshape(1, -1)).reshape(1, -1)
years = country_data.index
# Figure 1: original time-series for the specified country
plt.figure(figsize=(10, 5))
plt.plot(years, country_data.values)
plt.title(f"Original Time-series for{country_name}")
plt.xlabel("Year")
plt.ylabel("Gross Domestic Product")
plt.xticks(years[::5])
plt.show()
# Figure 2: incremental reconstruction of the original time-series
PC_counts = [1, 2, 4, 8, 16]
plt.figure(figsize=(10, 5))
for PC_count in PC_counts:
recon_std = compute_recon(country_data_std, PC_count, eigenVectors)
recon_original = scaler.inverse_transform(recon_std.reshape(1, -1)).flatten()
plt.plot(original_df.columns, recon_original, label=f"Reconstruction with {PC_count} PCs")
#plt.plot(original_df.columns, country_data, label="Original Data")
plt.title(f"Incremental Reconstruction of Original Time-series for{country_name}")
plt.xlabel("Year")
plt.ylabel("Gross Domestic Product")
plt.legend()
plt.xticks(years[::5])
plt.show()
# Figure 3: residual error for best reconstruction (PC=16)
best_recon_std = compute_recon(country_data_std, 16, eigenVectors)
best_recon_original = scaler.inverse_transform(best_recon_std.reshape(1, -1)).flatten()
residual_err = country_data - best_recon_original
plt.figure(figsize=(10, 5))
plt.plot(original_df.columns, residual_err)
plt.title(f"Residual Error for Best Reconstruction for{country_name}")
plt.xlabel("Year")
plt.ylabel("Residual Error")
plt.xticks(years[::5])
plt.show()
# Figure 4: RMSE of the reconstruction as a function of the number of included components
rmse_list = []
for i in range(1, 11):
recon_std = compute_recon(country_data_std, i, eigenVectors)
recon_original = scaler.inverse_transform(recon_std.reshape(1, -1)).flatten()
rmse = np.sqrt(mean_squared_error(country_data, recon_original))
rmse_list.append(rmse)
plt.figure(figsize=(10, 5))
plt.plot(range(1, 11), rmse_list)
plt.title(f"RMSE of Reconstruction for{country_name}")
plt.xlabel("Number of Principal Components")
plt.ylabel("RMSE")
plt.xticks(range(1, 11))
plt.show()
# Plot figures
new_countries = [" Greece ", " Somalia ", " Italy ", " Viet Nam ", " Zambia ", " Zimbabwe "]
for country in new_countries:
plot_country_figures(data_raw, country)
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:493: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:493: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:493: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:493: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:493: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:493: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names warnings.warn(
Part 4: SVDΒΆ
Modify your code in part 3 to use SVD instead of PCA.
Explain if standardization or covariance computation is required for this part. Repeat part 3 and compare your PCA and SVD results. Write a function to make this comparison, and comment on the results.
SVD does not depend on variance or the covariance matrix. Instead, it breaks the data matrix into singular values and vectors, working directly with raw data. This makes SVD suitable for dimensionality reduction without needing standardization.
The three plots provide a comparison of PCA and SVD applied to the standardized data. In the first plot, both PCA and SVD reconstructions closely match the actual GDP values over time, indicating that both methods effectively capture the main trends in the data. The two curves are overlapping, showing that the reconstructions are identical.
In the second plot that displays the residual errors, the curves for PCA and SVD are also similar. This suggests that both methods approximate the data equally well, with the same level of accuracy in capturing the fluctuations in GDP.
The third plot demonstrates RMSE of the reconstructions as the number of principal components increases. Both PCA and SVD illustrate a similar decline in RMSE with more components, indicating that both methods improve in accuracy as more components are added.
The same result is expected because the data are scaled so that each feature has a mean of 0 and a standard deviation of 1. In this case, applying SVD to standardized data essentially performs the same computations as PCA since they both decompose the covariance structure of the standardized data, leading to identical results.
from numpy.linalg import svd
def plot_country_figures(original_df, country_name):
country_data = original_df.loc[country_name]
years = country_data.index
# apply SVD
U, S, Vt = svd(original_df, full_matrices=False)
V = Vt.T
# Figure 1: original time-series for the specified country
plt.figure(figsize=(10, 5))
plt.plot(years, country_data.values)
plt.title(f"Original Time-series for{country_name}")
plt.xlabel("Year")
plt.ylabel("Gross Domestic Product")
plt.xticks(years[::5])
plt.show()
# Figure 2: incremental reconstruction of the original time-series
PC_counts = [1, 2, 4, 8, 16]
plt.figure(figsize=(10, 5))
for PC_count in PC_counts:
recon = compute_recon(country_data.values.reshape(1, -1), PC_count, V)
plt.plot(original_df.columns, recon.flatten(), label=f"Reconstruction with {PC_count} PCs")
plt.title(f"Incremental Reconstruction of Original Time-series for{country_name}")
plt.xlabel("Year")
plt.ylabel("Gross Domestic Product")
plt.legend()
plt.xticks(years[::5])
plt.show()
# Figure 3: residual error for best reconstruction (PC=16)
best_recon = compute_recon(country_data.values.reshape(1, -1), 16, V)
residual_err = country_data.values - best_recon.flatten()
plt.figure(figsize=(10, 5))
plt.plot(original_df.columns, residual_err)
plt.title(f"Residual Error for Best Reconstruction for{country_name}")
plt.xlabel("Year")
plt.ylabel("Residual Error")
plt.xticks(years[::5])
plt.show()
# Figure 4: RMSE of the reconstruction as a function of the number of included components
rmse_list = []
for i in range(1, 11):
recon = compute_recon(country_data.values.reshape(1, -1), i, V)
rmse = np.sqrt(mean_squared_error(country_data.values, recon.flatten()))
rmse_list.append(rmse)
plt.figure(figsize=(10, 5))
plt.plot(range(1, 11), rmse_list)
plt.title(f"RMSE of Reconstruction for{country_name}")
plt.xlabel("Number of Principal Components")
plt.ylabel("RMSE")
plt.xticks(range(1, 11))
plt.show()
# Plot figures
for country in new_countries:
plot_country_figures(data_raw, country)
def plot_pca_svd_comparison(original_df, country_name):
country_data = original_df.loc[country_name]
country_data_std = scaler.fit_transform(country_data.values.reshape(-1, 1)).reshape(1, -1)
years = country_data.index
# apply standardization on svd for comparison
U, S, Vt = svd(data_standardized, full_matrices=False)
V = Vt.T
# plot best reconstruction pca vs svd (PC=16)
best_recon_pca_std = compute_recon(country_data_std, 16, eigenVectors)
best_recon_pca = scaler.inverse_transform(best_recon_pca_std).flatten()
best_recon_svd_std = compute_recon(country_data_std, 16, V)
best_recon_svd = scaler.inverse_transform(best_recon_svd_std).flatten()
plt.figure(figsize=(10, 5))
plt.plot(years, best_recon_pca, label="PCA Reconstruction", linestyle="--")
plt.plot(years, best_recon_svd, label="SVD Reconstruction", linestyle=":")
#plt.plot(years, country_data.values, label="Original Data")
plt.title(f"Best Reconstruction PCA vs SVD for{country_name}")
plt.xlabel("Year")
plt.ylabel("Gross Domestic Product")
plt.legend()
plt.xticks(years[::5])
plt.show()
# plot residual errors for best reconstruction for pca vs svd
re_pca = country_data.values - best_recon_pca
re_svd = country_data.values - best_recon_svd
plt.figure(figsize=(10, 5))
plt.plot(years, re_pca, label="PCA Residual", linestyle="--")
plt.plot(years, re_svd, label="SVD Residual", linestyle=":")
plt.title(f"Residual Errors for Best Reconstruction for PCA vs SVD for{country_name}")
plt.xlabel("Year")
plt.ylabel("Residual Error")
plt.legend()
plt.xticks(years[::5])
plt.show()
# plot RMSE of reconstruction for pca vs svd
rmse_pca = []
rmse_svd = []
for i in range(1, 11):
recon_pca_std = compute_recon(country_data_std, i, eigenVectors)
recon_pca = scaler.inverse_transform(recon_pca_std).flatten()
rmse_pca.append(np.sqrt(mean_squared_error(country_data.values, recon_pca)))
recon_svd_std = compute_recon(country_data_std, i, V)
recon_svd = scaler.inverse_transform(recon_svd_std).flatten()
rmse_svd.append(np.sqrt(mean_squared_error(country_data.values, recon_svd)))
plt.figure(figsize=(10, 5))
plt.plot(range(1, 11), rmse_pca, label="PCA RMSE", linestyle="--")
plt.plot(range(1, 11), rmse_svd, label="SVD RMSE", linestyle=":")
plt.title(f"RMSE of Reconstruction for PCA vs SVD for{country_name}")
plt.xlabel("Number of Principal Components")
plt.ylabel("RMSE")
plt.legend()
plt.xticks(range(1, 11))
plt.show()
# plot comparison figures
for country in new_countries:
plot_pca_svd_comparison(data_raw, country)
Part 5: Let's collect another dataset!ΒΆ
Create another dataset similar to the dataset in part 1, this time using the raw information on average monthly electricity sales across various states of the United States from 2001 to 2024 here.
You need to manipulate the data to organize it in the desired format (i.e., the same format that was in previous parts). Missing values were removed such that if there was a missing value for the average electricity sales of a particular state at a given date, that date has been completely removed from the dataset, even if the data of that specific date existed for other states.
You are free to use any tools you like, from Excel to Python! In the end, you should have a new CSV file similar to the previous dataset. How many features does the final dataset have? How many cities are there?
There are 277 features and 62 unique cities in train.csv.
Upload your new dataset (in CSV format) to your colab notebook, repeat part 4 for this dataset, and comment on the results. When analyzing the states, use New York, Utah, and three other states with the closest alphabetical names to your first name.
The best reconstruction plot for PCA vs SVD illustrates the reconstructed average electricity sales over time. Both the PCA and SVD curves go up and down repeatedly and overlap, meaning that they effectively capture the cyclical patterns in the data, particularly the seasonal spikes on some specific months. These repeating peaks imply a seasonal trend in electricity usage, likely aligning with weather changes.
The residual error plot emphasizes the consistency between PCA and SVD as they are identical. Although, there are periodic errors, possibly due to the seasons, both methods struggle equally, confirming their similar performance in capturing the underlying sales patterns. Overall, the plots highlight the regular seasonal fluctuations in the specific state's electricity data.
The RMSE decreases for original data when applying SVD as more components capture meaningful variance, improving accuracy. For standardized data, RMSE increases with more components. This may because of the overfitting noise from equal feature weighting. The trend on original data is typically more reliable, reflecting true variance.
The code below helps you to upload your new CSV file to your colab session.
# load train.csv to Google Colab
from google.colab import files
uploaded = files.upload()
Saving train.csv to train.csv
city_data = pd.read_csv("train.csv", index_col=0)
def plot_city_figures(original_df, city_name):
city_data = original_df.loc[city_name]
years = city_data.index
# apply SVD
U, S, Vt = svd(original_df, full_matrices=False)
V = Vt.T
# Figure 1: original time-series for the specified city
plt.figure(figsize=(20, 5))
plt.plot(years, city_data.values)
plt.title(f"Original Time-series for {city_name}")
plt.xlabel("Year Month")
plt.ylabel("Average Electricity Sales")
plt.xticks(years[::12], rotation=45)
plt.show()
# Figure 2: incremental reconstruction of the original time-series
PC_counts = [1, 2, 4, 8, 16]
plt.figure(figsize=(20, 5))
for PC_count in PC_counts:
recon = compute_recon(city_data.values.reshape(1, -1), PC_count, V)
plt.plot(original_df.columns, recon.flatten(), label=f"Reconstruction with {PC_count} PCs")
plt.title(f"Incremental Reconstruction of Original Time-series for {city_name}")
plt.xlabel("Year Month")
plt.ylabel("Average Electricity Sales")
plt.legend()
plt.xticks(years[::12], rotation=45)
plt.show()
# Figure 3: residual error for best reconstruction (PC=16)
best_recon = compute_recon(city_data.values.reshape(1, -1), 16, V)
residual_err = city_data.values - best_recon.flatten()
plt.figure(figsize=(20, 5))
plt.plot(original_df.columns, residual_err)
plt.title(f"Residual Error for Best Reconstruction for {city_name}")
plt.xlabel("Year Month")
plt.ylabel("Residual Error")
plt.xticks(years[::12], rotation=45)
plt.show()
# Figure 4: RMSE of the reconstruction as a function of the number of included components
rmse_list = []
for i in range(1, 11):
recon = compute_recon(city_data.values.reshape(1, -1), i, V)
rmse = np.sqrt(mean_squared_error(city_data.values, recon.flatten()))
rmse_list.append(rmse)
plt.figure(figsize=(10, 5))
plt.plot(range(1, 11), rmse_list)
plt.title(f"RMSE of Reconstruction for {city_name}")
plt.xlabel("Number of Principal Components")
plt.ylabel("RMSE")
plt.xticks(range(1, 11))
plt.show()
# Plot figures
cities = ["New York", "Utah", "West Virginia", "Wisconsin", "Wyoming"]
for city in cities:
plot_city_figures(city_data, city)
def plot_pca_svd_comparison(original_df, city_name):
city_data = original_df.loc[city_name]
city_data_std = scaler.fit_transform(city_data.values.reshape(-1, 1)).reshape(1, -1)
years = city_data.index
# apply standardization on svd for comparison
data_standardized = original_df.copy()
data_standardized[:] = scaler.fit_transform(data_standardized)
U, S, Vt = svd(data_standardized, full_matrices=False)
V = Vt.T
# plot best reconstruction pca vs svd (PC=16)
df_cov = np.cov(data_standardized.T)
eigenValues, eigenVectors = get_sorted_eigen(df_cov)
best_recon_pca_std = compute_recon(city_data_std, 16, eigenVectors)
best_recon_pca = scaler.inverse_transform(best_recon_pca_std).flatten()
best_recon_svd_std = compute_recon(city_data_std, 16, V)
best_recon_svd = scaler.inverse_transform(best_recon_svd_std).flatten()
plt.figure(figsize=(20, 5))
plt.plot(years, best_recon_pca, label="PCA Reconstruction", linestyle="--")
plt.plot(years, best_recon_svd, label="SVD Reconstruction", linestyle=":")
#plt.plot(years, city_data.values, label="Original Data")
plt.title(f"Best Reconstruction PCA vs SVD for {city_name}")
plt.xlabel("Year Month")
plt.ylabel("Average Electricity Sales")
plt.legend()
plt.xticks(years[::12], rotation=45)
plt.show()
# plot residual errors for best reconstruction for pca vs svd
re_pca = city_data.values - best_recon_pca
re_svd = city_data.values - best_recon_svd
plt.figure(figsize=(20, 5))
plt.plot(years, re_pca, label="PCA Residual", linestyle="--")
plt.plot(years, re_svd, label="SVD Residual", linestyle=":")
plt.title(f"Residual Errors for Best Reconstruction for PCA vs SVD for {city_name}")
plt.xlabel("Year Month")
plt.ylabel("Residual Error")
plt.legend()
plt.xticks(years[::12], rotation=45)
plt.show()
# plot RMSE of reconstruction for pca vs svd
rmse_pca = []
rmse_svd = []
for i in range(1, 11):
recon_pca_std = compute_recon(city_data_std, i, eigenVectors)
recon_pca = scaler.inverse_transform(recon_pca_std).flatten()
rmse_pca.append(np.sqrt(mean_squared_error(city_data.values, recon_pca)))
recon_svd_std = compute_recon(city_data_std, i, V)
recon_svd = scaler.inverse_transform(recon_svd_std).flatten()
rmse_svd.append(np.sqrt(mean_squared_error(city_data.values, recon_svd)))
plt.figure(figsize=(10, 5))
plt.plot(range(1, 11), rmse_pca, label="PCA RMSE", linestyle="--")
plt.plot(range(1, 11), rmse_svd, label="SVD RMSE", linestyle=":")
plt.title(f"RMSE of Reconstruction for PCA vs SVD for {city_name}")
plt.xlabel("Number of Principal Components")
plt.ylabel("RMSE")
plt.legend()
plt.xticks(range(1, 11))
plt.show()
# plot comparison figures
for city in cities:
plot_pca_svd_comparison(city_data, city)
ReferencesΒΆ
Understanding PCA and SVD:
https://towardsdatascience.com/pca-and-svd-explained-with-numpy-5d13b0d2a4d8
https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.8-Singular-Value-Decomposition/
PCA:
U.S. Electricity Prices Data: